1 Topic Modeling

The topic Modelling is an unsupervised method that is used to deduce the abstract topics discussed over a collection of documents. Since the aim of our project is to classify documents, we have used topic modeling as a means to label our data. Once we have the labeled data, the unseen test documents are classified based on the topic probabilities.

1.1 Search for optimal topic number :

The first step in performing LDA is to deduce the optimal number of “topics”. This is achieved by using the perplexity measure.Since all the topics are represented by probabilities, we need to measure how well these distributions predict a sample, so we use perplexity. The perplexity measure is applied on LDA objects with k ranging from 10 to 30 for both the Bag Of Words model and the TF-IDF Model. The LDA object with the lowest K is deemed to be the best model, and k is deemed to be the optimal number of topics.

...

So in our case, the best model turned out to be the Baf Of Words model, with K=25 topics.

1.2 Model building

The LDA model was built for term frequency with k = 25 topics. Both the Gibbs Sampling and the Dot product was used for this purpose.

Model for term frequency with Gibbs sampling

LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
                                 iterations = 200, burnin = 175)
    p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], 
                      method = "gibbs",iterations = 200, burnin = 175)
    p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")

Model for term frequency with Dot product sampling

LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
                                 iterations = 200, burnin = 175)
    p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], 
                      method = "dot",iterations = 200, burnin = 175)
    p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")

So now each document has a probability associated with it with respect to the 25 topics. This acts as the labelled data for further prediction.

1.3 Prediction

Once we have all the documents in the training set labeled, the next step is predicting the topic probabilities for the unseen test set. The predict method of LDA is used to predict the topic probabilities.

Predicting the topics using Term frequency with Gibbs sampling model

p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "gibbs",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with Gibbs sampling model can be seen in the plot below.
...

Predicting the topics using Term frequency with Dot product sampling model

p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot",iterations = 200, burnin = 175)

The probability distribution of the topics in the train & test set for the Term frequency with dot product sampling model can be seen in the plot below.

Perplexity Score

Perplexity Score

1.4 Model evaluation

The next step is to evaluate the model, for which we used log likelihood. Higher the value, the better is the model. The plot below shows the log likehood for the two models.

Perplexity Score

Perplexity Score

From the plot it is evident that the bag of words model (term frequency) performs better and hence we have used this model here forth.

2 Associations & similarities between topics

Once the prediction is done, we now have topic probabilities for all the documents. It is interesting to find similarities in-between topics, so we are clustering the documents based on their topic probabilities.

2.1 Finding optimal number of clusters

2.1.1 Elbow curve

To perform clustering, we need to decide on the optimal number of clusters. This was determined by using elbow curve The optimal number of clusters by elbow curve is 8.

#Reducing the dimensions via tsne
tsne <- Rtsne(doc_topics_gamma[,-1], perplexity = 30, pca = FALSE, check_duplicates = FALSE)
X <- data.frame(tsne$Y)

#Find best no. of clusters for 25 topics
wss <- (nrow(X)-1)*sum(apply(X,2,var))
for (i in 1:100) wss[i] <- sum(kmeans(X,iter.max = 50L,centers=i)$withinss)
plot(1:100, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Perplexity Score

Perplexity Score

2.1.2 Silhouette coefficient

Another approach to find optimal number of clusters used was silhouette coefficient. The silhouette coefficient is used to determine the inter and intra distance for all the points within the cluster to themselves and to the points in the other cluster. We evaluated this value for 8 & 15 clusters and the results can be seen in the plots below.

Perplexity Score

Perplexity Score

Perplexity Score

Perplexity Score

The silhouette coefficient for our cluster was 0.33. Given our dataset where all our documents are talking about coronavirus, its no wonder the value for silhouette coefficient is less as the distance between the cluster is negligle and thus the documents within them.

2.2 Clustering

Finally, the articles were grouped into 8 clusters.

k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)

fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

2.3 Topics association to clusters

It would be interesting to see how the topics are associated to the clusters. The chord diagram shows the association of each of the topics to the clusters.

Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

To line chart represents the behavior of the topic in each of the clusters. The chord diagram shows the association of each of the topics to the clusters.
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

3 Visualization of the corpus

The entire document corpus has been visualized in the RBokeh graph. On hovering on the documents, it can be seen that the documents belonging to the same topics are relatively close to each other. However, some exception exists near the boundaries of each topic.Hovering over the documents, displays the title, URL, and the most dominant topic in it.

Similary we plotted the articles from the test and train set in rbokeh.The purpose of doing so is view the performance of our model. From the plot it can be seen that topics predicted for the documents in the test set, lie in the same region as the documents belonging to the same topic in the training set.

The pie chart below gives the distribution of the topics dominant in each article of the whole corpus.

4 Topic term Association

The below section represents the association between the top terms in the document and the topics generated from the document

Terms in Topics

Terms in Topics

5 Topic progression

Finally, the progression of the topics that were mainly discussed during the initial months of the pandemic are displayed below

6 Conclusion

The news articles published in the past few months discussed about different aspects of coronavirus. Progression of topics during the duration of January to April was evident. Amongst our models, LDA with Bag of words - term frequency had better performance as compared to other models. Our model was successful in predicting topic distributions for test articles.We were therefore able to find similar articles, given an unseen test article.